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Abstract 

High-tlirougliput sequencing teclinology allows researchers to test associations between phenotypes and all the 
variants identified throughout the genome, and is especially useful for analyzing rare variants. However, the 
statistical power to identify phenotype-associated rare variants is very low with typical genome-wide association 
studies because of their low allele frequencies among unrelated individuals. In contrast, a family-based design may 
have more power because rare variants are more likely to be enriched in families than among unrelated 
individuals. Regardless, an analysis of family-based association studies needs to account appropriately for 
relatedness between family members. We analyzed the observed quantitative trait systolic blood pressure as well 
as the simulated Ql data in the Genetic Analysis Workshop 18 data set using 4 tests: (a) a single-variant test, (b) a 
collapsing test, (c) a single-variant test where familial relatedness was accounted for, and (d) a collapsing test 
where familial relatedness was accounted for. We then compared the results of the 4 methods and observed that 
adjusting for familial relatedness could appropriately control the false-positive rate while maintaining reasonable 
power to detect several strongly associated variants/genes. 



Background 

Current platforms for genome-wide association studies are 
limited to scanning common variants. Although rare var- 
iants may contribute a significant proportion of heritabil- 
ity, the statistical power to detect rare variants is low 
because of their low allele frequencies. Recent advances in 
high-throughput sequencing technologies have provided 
us great opportunities to delve deeper into the genetic 
components of complex traits by identifying millions of 
rare variants in the human genome [1] and allowing them 
to be tested for associations with complex traits. 

In an effort to increase the power to detect rare-variant 
associations, many methods have been proposed to aggre- 
gate the effects of multiple rare variants within a specific 
functional unit, for example, a gene [2]. Among those 
methods, the kernel score test has enjoyed great popularity 
thanks to its flexibility and computational efficiency [3]. 
Family-based designs may offer more power, however, 
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because related individuals are more likely than unrelated 
individuals to be enriched for rare variants [4] . One chal- 
lenge of analyzing family data is that familial relatedness 
needs to be carefully adjusted for in the association analy- 
sis. Linear mixed models have been used to account for 
population stratification in the context of genome-wide 
association studies (GWAS) [5], and can also be naturally 
adapted to a family-based design. In this paper, we 
adopted a linear mixed model to adjust for familial relat- 
edness while aggregating the effects of rare variants within 
a gene using a kernel score test. 

Methods 

Data description and preprocessing 

The Genetic Analysis Workshop 18 (GAW18) data con- 
sists of genotyping and sequencing data for 959 individuals 
from 20 extended families. Among them, 849 individuals 
had at least 1 blood pressure measurement. For the real 
data, we used the first nonmissing measurement of systolic 
blood pressure (SBP) as the target quantitative trait. For 
the 200 replicates of the simulated phenotype, we also used 
the first measurement of SBP as the target quantitative 
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trait. Finally, to evaluate the false-positive rate, we used the 
simulated Ql data as the null phenotype. 

The common variants in the genotyping data were 
used to estimate the genetic similarity matrix. As for the 
sequencing data, we selected only the nonsynonymous 
mutations for the association test in recognition of the 
fact that variants causing amino acid alterations tend to 
have large effects on the phenotype. 

The model 

We used the following linear mixed model to adjust for 
familial relatedness: Y = Cy + xp + Za + e where Y repre- 
sents the phenotype (quantitative trait), C represents the 
collection of the covariates (age and gender), X includes 
the genotypes of the variants to be tested, and Z is the 
design matrix for the whole-genome polygenic random 
effects a. Finally, Y and P are fixed effects and e is the ran- 
dom residual. When the random effects are integrated out, 
the model is marginalized as 7 ~ N(Cy + XfS, Ka^ + la^), 
where and are the variances for the random 
effects a and e, respectively. As for K, one could use twice 
the theoretical kinship matrix to represent the familial 
relationships. However, to account for potential cryptic 
relatedness between individuals across different 
families, we used a genetic similarity matrix that was 
estimated from the genotyping data as follows: 
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her of minor alleles for ith individual at the mth marker, 
/,„ is the allele frequency for the wth marker, and p is the 
total number of markers in the genotyping data. 

We investigated 4 methods of association analysis: (a) a 
single-variant association test without adjusting for familial 
relatedness (UNI), that is, the component Za is excluded 
in the model; (b) a single-variant association test adjusting 
for familial relatedness (UNI-ADJ); (c) a kernel score test 
without adjusting for familial relatedness (SKAT); and (d) 
a kernel score test adjusting for familial relatedness 
(SKAT-ADJ). The first method is a simple linear regres- 
sion. The second method tests the null hypothesis of zero 
fixed effects in a linear mixed model. Specifically, we used 
the likelihood ratio test implemented in the software 
GEMMA [6] . The third method is the direct application of 
the SKAT software [3] that ignores familial relatedness. As 
for the fourth method, we adopted the approach described 
in Ref [7] . Specifically, we estimated the covariance matrix 
R = GOV (Za + e) = Ku^ + la^ under the null model with- 
out genes or variants being tested. Then we applied a data 
transformation to calculate x, and c, where: 
X = R'^I^X, X = R~^/^X, C = RT^I^C, leading to the fol- 
lowing transformed model, Y = Cy + Xp + e, where 
e = R~^l^{Za + e)- Note that each e, is independent and 



identically distributed; therefore, the kernel score test can 
be appropriately applied on the transformed data. 

Results 

Simulated data 

For the simulated data, we analyzed the genes that 
explain the highest percentages of SBP variance (the per- 
centages of SBP variance explained by each gene in the 
simulation were provided by the GAW18 organizers) 
using the 4 methods described above. Genes without 
nonsynonymous mutations were excluded. For the first 
two single-variant methods, we performed a Bonferroni 
correction on the smallest p value of the variants in a 
gene to obtain the gene-level p value. To assess the false- 
positive rate, we also calculated the p values using Ql as 
the phenotype. The power and the false-positive rate 
were calculated based on the 200 replicates. Table 1 
shows the results for the top 7 genes. As the table shows, 
ignoring familial relatedness results in inflation of the 
false-positive rate using the UNI and SKAT methods, in 
contrast with the UNI-ADJ and SKAT-ADJ methods, 
which still demonstrate good control of the false-positive 
rate. As for the power, the single-variant method does 
better for genes whose effects on the phenotype are 
dominated by a single variant (eg, LEPR) whereas the col- 
lapsing method is more powerful for genes containing 
multiple variants with comparable efi'ect sizes. 

Real data 

Consistent with our observation in the simulated data, the 
Q-Q plot (Figure 1) shows that the UNI and SKAT meth- 
ods that ignore familial relatedness result in substantial 
inflation of the false-positive rate. For UNI-ADJ, only 1 
variant (chrll:119059358) has a Bonferroni-corrected 
p value smaller than 0.05 (PBonferroni = 0.038). It is a coding 
variant for gene PDZD3. To our knowledge, the literature 
has suggested no role for this gene in blood pressure. For 
SKAT-ADJ, there is no gene with a Bonferroni-corrected 
p value smaller than 1. 

Discussion and conclusions 

In this study, we investigated the utility of linear mixed 
models in adjusting for familial structure. We also 
attempted to combine a linear mixed model method 
with a kernel score test-a state-of-the-art collapsing 
methodology that tests for association between a group 
of variants and a phenotype. 

Conclusions 

We found that linear mixed models are able to satisfac- 
torily adjust for familial relatedness in the sense that 
false-positive rates are well controlled at the nominal 
significance level. Not surprisingly, in terms of the sta- 
tistical power, the performance of the single-variant 
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Table 1 The power and false-positive rate (FPR) of the 4 methods in simulated data 
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Power 


FPR 


MAP4 


8 


1 


0.12 


1 


0.04 


1 


0.13 


1 


0.03 


TNN 


20 


0.9 


0.225 


0.74 


0.055 


0.895 


0.38 


0.825 


0.04 


LEPR 


6 


0.915 


0.125 


0.895 


0.05 


0.22 


0.085 


0.73 


0.04 


FLT3 


4 


0.66 


0.105 


0.3 


0.045 


0.33 


0.135 


0.305 


0.07 


FLNB 


13 


0.44 


0.14 


0.165 


0.03 


0.08 


0.17 


0.515 


0.035 


ZNF443 


9 


0.070 


0.155 


0.035 


0.030 


0.150 


0.075 


0.050 


0.075 


GSN 


6 


0.265 


0.155 


0.130 


0.070 


0.030 


0.065 


0.385 


0.065 


A significance level of 0.05 was used to calculate both power and FPR. 
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Figure 1 Q-Q plots for the 4 methods in real data. The Q-Q plots are based on -loglO p-values of the 7781 genes with at least 1 
nonsynonymous mutation in the real data. Red curves represent the observed p values and the black curves represent the expected p values 
under the null model. 



method and the collapsing method depends on the dis- 
tribution of the variants' effect sizes within the gene. 
The single-variant method is more powerful for genes 
with one dominating causal variant, whereas the collap- 
sing method is more powerful for genes with multiple 
causal variants of similar effect sizes. 
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